Systems and Methods for Automatically Generating International Classification of Diseases Codes for a Patient Based on Machine Learning

ABSTRACT

A system for automatically assigning a set of ICD codes to a patient includes a diagnostic description encoding module and an ICD code assignment module. The diagnostic description encoding module is configured obtain a diagnostic description vector from at least one diagnostic description record of the patient. The diagnostic description record may be in the form of hand-written physician notes. The ICD code assignment module is configured to apply a machine-learned ICD code assignment algorithm to the diagnostic description vector to assign a set of ICD codes to the patient. When multiple codes are assigned, the machine-learned ICD code assignment algorithm establishes an order of importance for the ICD codes.

CROSS-REFERENCE TO RELATED APPLICATION(S)

This application claims the benefit of and priority to 1) U.S. Provisional Patent Application Ser. No. 62/699,385, filed Jul. 17, 2018, for “Diversity-Promoting and Large-Scale Machine Learning for Healthcare”, and 2) U.S. Provisional Patent Application Ser. No. 62/756,024, filed Nov. 5, 2018, for “Diversity-Promoting and Large-Scale Machine Learning for Healthcare”, the entire disclosures of which are incorporated herein by references.

This application has subject matter in common with: 1) U.S. patent application Ser. No. 16/038,895, filed Jul. 18, 2018, for “A Machine Learning System for Measuring Patient Similarity”, 2) U.S. patent application Ser. No. 15/946,482, filed Apr. 5, 2018, for “A Machine Learning System for Disease, Patient, and Drug Co-Embedding, and Multi-Drug Recommendation”, 3) U.S. patent application Ser. No. ______, filed ______, for “Systems and Methods for Predicting Medications to Prescribe to a Patient Based on Machine Learning”, 4) U.S. patent application Ser. No. ______, filed ______, for “Systems and Methods for Medical Topic Discovery Based on Large-Scale Machine Learning”, 5) U.S. patent application Ser. No. ______filed ______, for “Systems and Methods for Automatically Tagging Concepts to, and Generating Text Reports for, Medical Images Based on Machine Learning”, the entire disclosures of which are incorporated herein by reference, and the entire disclosures of which are incorporated herein by reference.

TECHNICAL FIELD

The present disclosure generally relates to machine learning for healthcare, and more particularly, to systems and methods that apply machine learning algorithms to diagnostic documents to automatically generate an international classification of diseases (ICD) code for a patient.

BACKGROUND

With the widespread adoption of electronic health records (EHR) systems, and the rapid development of new technologies such as high-throughput medical imaging devices, low-cost genome profiling systems, networked and even wearable sensors, mobile applications, and rich accumulation of medical knowledge/discoveries in databases, a tsunami of medical and healthcare data has emerged. It was estimated that 153 exabytes (one exabyte equals one billion gigabytes) of healthcare data were produced in 2013. In 2020, an estimated 2314 exabytes will be produced. From 2013 to 2020, an overall rate of increase is at least 48 percent annually.

In addition to the sheer volume, the complexity of healthcare data is also overwhelming. Such data includes clinical notes, medical images, lab values, vital signs, etc., coming from multiple heterogeneous modalities including texts, images, tabular data, time series, graph and so on. The rich clinical data is becoming an increasingly important source of holistic and detailed information for both healthcare providers and receivers. Collectively analyzing and digesting these rich information generated from multiple sources; uncovering the health implications, risk factors, and mechanisms underlying the heterogeneous and noisy data records at both individual patient and whole population levels; making clinical decisions including diagnosis, triage, and treatment thereupon, are now routine activities expected to be conducted by medical professionals including physicians, nurses, pharmacists and so on.

As the amount and complexity of medical data are rapidly growing, these activities are becoming increasingly more difficult for human experts. The information overload makes medical analytics and decisions-making time consuming, error-prone, suboptimal, and less-transparent. As a result, physicians, patients, and hospitals suffer a number of pain points, quality-wise and efficiency-wise. For example, in terms of quality, 250,000 Americans die each year from medical errors, which has become the third leading cause of death in the United States. Twelve million Americans are misdiagnosed each year. Preventable medication errors impact more than 7 million patients and cost almost $21 billion annually. Fifteen to twenty-five percent of patients are readmitted within 30 days and readmissions are costly (e.g., $41.3 billion in 2011). In terms of inefficiency, patients wait on average 6 hours in emergency rooms. Nearly 400,000 patients wait 24 hours or more. Physicians spend only 27 percent of their office day on direct clinical face time with patients. The U.S. healthcare system wastes $750 billion annually due to unnecessary services, inefficient care delivery, excess administrative costs, etc.

The advancement of machine learning (ML) technology opens up opportunities for next generation computer-aided medical data analysis and data-driven clinical decision making, where machine learning algorithms and systems can be developed to automatically and collectively digest massive medical data such as electronic health records, images, behavioral data, and the genome, to make data-driven and intelligent diagnostic predictions. An ML system can automatically analyze multiple sources of information with rich structure; uncover the medically meaningful hidden concepts from low-level records to aid medical professionals to easily and concisely understand the medical data; and create a compact set of informative diagnostic procedures and treatment courses and make healthcare recommendations thereupon.

It is therefore desirable to leverage the power of machine learning in automatically distilling insights from large-scale heterogeneous data for automatic smart data-driven medical predictions, recommendations, and decision-making, to assist physicians and hospitals in improving the quality and efficiency of healthcare. It is further desirable to have machine learning algorithms and systems that turn the raw clinical data into actionable insights for clinical applications. One such clinical application relates to assigning International Classification of Diseases (ICD) coding.

When applying machine learning to healthcare application, several fundamental issues may arise, including:

1) How to better capture infrequent patterns: At the core of ML-based healthcare is to discover the latent patterns (e.g., topics in clinical notes, disease subtypes, phenotypes) underlying the observed clinical data. Under many circumstances, the frequency of patterns is highly imbalanced. Some patterns have very high frequency while others occur less frequently. Existing ML models lack the capability of capturing infrequent patterns. Known convolutional neural network do not perform well on infrequent patterns. Such a deficiency of existing models possibly results from the design of their objective function used for training. For example, a maximum likelihood estimator would reward itself by modeling the frequent patterns well as they are the major contributors to the likelihood function. On the other hand, infrequent patterns contribute much less to the likelihood, thereby it is not very rewarding to model them well and they tend to be ignored. Infrequent patterns are of crucial importance in clinical settings. For example, many infrequent diseases are life-threatening. It is critical to capture them.

2) How to alleviate overfitting: In certain clinical applications, the number of medical records available for training is limited. For example, when training a diagnostic model for an infrequent disease, typically there is no access to a sufficiently large number of patient cases due to the rareness of this disease. Under such circumstances, overfitting easily happens, wherein the trained model works well on the training data but generalizes poorly on unseen patients. It is critical to alleviate overfitting.

3) How to improve interpretability: Being interpretable and transparent is a must for an ML model to be willingly used by human physicians. Oftentimes, the patterns extracted by existing ML methods have a lot of redundancy and overlap, which are ambiguous and difficult to interpret. For example, in computational phenotyping from EHRs, it is observed that the learned phenotypes by the standard matrix and tensor factorization algorithms have much overlap, causing confusion such as two similar treatment plans are learned for the same type of disease. It is necessary to make the learned patterns distinct and interpretable.

4) How to compress model size without sacrificing modeling power: In clinical practice, making a timely decision is crucial for improving patient outcome. To achieve time efficiency, the size (specifically, the number of weight parameters) of ML models needs to be kept small. However, reducing the model size, which accordingly reduces the capacity and expressivity of this model, typically sacrifice modeling power and performance. It is technically appealing but challenging to compress model size without losing performance.

5) How to efficiently learn large-scale models: In certain healthcare applications, both the model size and data size are large, incurring substantial computation overhead that exceeds the capacity of a single machine. It is necessary to design and build distributed systems to efficiently train such models.

The ICD is a healthcare classification system maintained by the World Health Organization. It provides a hierarchy of diagnostic codes of diseases, disorders, injuries, signs, symptoms, etc. It is widely used for reporting diseases and health conditions, assisting in medical reimbursement decisions, collecting morbidity and mortality statistics, to name a few.

While ICD codes are important for making clinical and financial decisions, medical coding—which assigns proper ICD codes to a patient visit is time-consuming, error-prone and expensive. Medical coders review the diagnosis descriptions written by physicians in the form of textual phrases and sentences and (if necessary) other information in the electronic medical record of a clinical episode, then manually attribute the appropriate ICD codes by following the coding guidelines. Several types of errors frequently occur. First, the ICD codes are organized in a hierarchical structure. For a node representing a disease C, the children of this node represents the subtypes of C. In many cases, the difference between disease subtypes is very subtle. It is common that human coders select incorrect subtypes. Second, when writing diagnosis descriptions, physicians often utilize abbreviations and synonyms, which causes ambiguity and imprecision when the coders are matching ICD codes to those descriptions. Third, in many cases, several diagnosis descriptions are closely related and should be mapped to a single ICD code. However, unexperienced coders may code each disease separately. Such errors are called unbundling. The cost incurred by coding errors and the financial investment spent on improving coding quality are estimated to be $25 billion per year in the United States.

To reduce coding errors and cost, it is desirable to build an ICD coding model which automatically and accurately translates the free-text diagnosis descriptions into ICD codes. To achieve this goal, several technical challenges need to be addressed.

First, there exists a hierarchical structure among the ICD codes. This hierarchy can be leveraged to improve coding accuracy. On one hand, if code A and B are both children of C, then it is unlikely to simultaneously assign A and B to a patient. On the other hand, if the distance between A and B in the code tree is smaller than that between A and C and we know A is the correct code, then B is more likely to be a correct code than C, since codes with smaller distance are more clinically relevant. How to explore this hierarchical structure for better coding is technically demanding.

Second, the diagnosis descriptions and the textual descriptions of ICD codes are written in quite different styles even if they refer to the same disease. In particular, the textual description of an ICD code is formally and precisely worded, while diagnosis descriptions are usually written by physicians in an informal and ungrammatical way, with telegraphic phrases, abbreviations, and typos.

Third, it is required that the assigned ICD codes are ranked according to their relevance to the patient. How to correctly determine this order is technically nontrivial.

Fourth, as stated earlier, there does not necessarily exist a one-to-one mapping between diagnosis descriptions and ICD codes, and human coders should consider the overall health condition when assigning codes. In many cases, two closely related diagnosis descriptions need to be mapped onto a single combination ICD code. On the other hand, physicians may write two health conditions into one diagnosis description which should be mapped onto two ICD codes under such circumstances.

SUMMARY

In one aspect of the disclosure, a method of assigning a set of international classification of diseases (ICD) codes to a patient includes obtaining a diagnostic description vector from at least one diagnostic description record of the patient; and applying a machine-learned ICD code assignment algorithm to the diagnostic description vector to assign a set of ICD codes to the patient.

In another aspect of the disclosure, a system for assigning a set of ICD codes to a patient includes a diagnostic description encoding module and an ICD code assignment module. The diagnostic description encoding module is configured to obtain a diagnostic description vector from at least one diagnostic description record of the patient. The ICD code assignment module is configured to apply a machine-learned ICD code assignment algorithm to the diagnostic description vector to assign a set of ICD codes to the patient.

In another aspect of the disclosure, a machine learning apparatus for generating a map between diagnostic descriptions and ICD codes includes a processor and a memory coupled to the processor. The processor is configured to generate representations of diagnostic descriptions in a form of diagnostic descriptions vectors, and to generate representations of ICD codes in a form of ICD vectors. The processor is further configured to process the diagnostic descriptions vectors and the ICD vectors to obtain an importance score between each diagnostic description represented in a diagnostic description vector and each ICD represented in an ICD vector, and to associate each diagnostic description represented in the diagnostic description vector with one or more ICDs represented in the ICD vector based on the importance scores.

It is understood that other aspects of methods and systems will become readily apparent to those skilled in the art from the following detailed description, wherein various aspects are shown and described by way of illustration.

BRIEF DESCRIPTION OF THE DRAWINGS

Various aspects of apparatuses and methods will now be presented in the detailed description by way of example, and not by way of limitation, with reference to the accompanying drawings, wherein:

FIG. 1 is a block diagram of a system for generating an international classification of diseases (ICD) code for a patient using a machine-learned algorithm.

FIG. 2 is a block diagram of a model used to develop and train the machine-learned algorithm of FIG. 1.

FIG. 3 is a block diagram of the architectural layers and functionality of the model used to develop and train the machine-learned algorithm of FIG. 1.

FIG. 4 is a block diagram of a computing device that embodies the system of FIG. 1.

FIG. 5 is a block diagram of an apparatus that develops and trains the machine-learned algorithm of FIG. 1.

DETAILED DESCRIPTION

Disclosed herein is a system for automatically assigning an international classification of diseases (ICD) code for a patient using a machine-learned algorithm. The system comprises a neural architecture that automatically perform ICD coding based on an input corresponding to a patient's diagnosis descriptions. The diagnosis descriptions may be input in the form of a physician's free-form writing. The neural architecture has four aspects: First, the architecture uses a tree-of-sequences long short-term memory (LSTM) network to simultaneously capture the hierarchical relationship among ICD codes and the semantics of each ICD code. Second, the architecture utilizes an adversarial learning approach to reconcile the different writing styles of diagnosis descriptions and ICD code descriptions. Third, the architecture utilizes isotonic constraints to preserve the importance order among codes, and an algorithm based on the alternating direction method of multipliers (ADMM) to solve the constrained problem. Fourth, the architecture employs an attentional matching mechanism to perform many-to-one and one-to-many mappings between diagnosis descriptions and ICD codes. Some of the concepts and features described herein are included in Diversity-promoting and Large-scale Machine Learning for Healthcare, a thesis submitted by Pengtao Xie in August 2018 to the Machine Learning Department, School of Computer Science, Carnegie Mellon University, which is hereby incorporated by reference in its entirety.

With reference to FIG. 1, in one configuration, an ICD coding system 100 includes a diagnostic description encoding module 102 and an ICD code assignment module 104. The diagnostic description encoding module 102 is configured to receive a diagnostic description record 106 for a subject patient and to produce a representation of the record as an encoded diagnostic description (DD) vector 108. The ICD code assignment module 104 receives the diagnostic description as the diagnostic description (DD) vector 108 and applies a previously-trained machine-learned algorithm 110 to the vector. The machine-learned algorithm 110 determines relevant ICD codes corresponding to the diagnostic descriptions included in the diagnostic description vector 108 and outputs an assigned ICD code 112 for the patient.

Regarding the diagnostic description encoding module 102, it is configured to extract information from the diagnostic description record 106, and derive the encoded diagnostic description (DD) vector 108 from the extracted information. The diagnostic description record 106 may be hand written diagnostic notes identifying one or more diagnoses of the patient. In one embodiment, the diagnostic description encoding module 102 employs a LSTM recurrent neural network (RNN) to encode the diagnosis descriptions. An example of such a recurrent network is described in Martin Sundermeyer, Ralf Schluter, and Hermann Ney. LSTM neural networks for language modeling. In Thirteenth Annual Conference of the International Speech Communication Association, 2012, the disclosure of which is herein incorporated by reference. In another embodiment, the diagnostic description encoding module 102 employs a sequential long short-term memory (SLSTM) network to encode each description individually.

With reference to FIG. 2, the machine-learned algorithm 110 of the ICD code assignment module 104 of FIG. 1 is designed and trained using a model that consists of five modules: a diagnostic description encoding module 202 that encodes diagnostics descriptions, an ICD code description encoding module 204 that encodes ICD codes based on their textual descriptions, an adversarial reconciliation module 206, an attentional matching module 208 (with an embedded isotonic constraints module 210) that matches diagnosis descriptions with ICD codes, and assigns the ICD codes. The functional architecture of the model of FIG. 2 is shown in FIG. 3. The diagnostic description encoding module 202 used to develop and train the machine-learned algorithm 110 is configured to encode diagnostics descriptions in the same way as the diagnostic description encoding module 102 of the ICD coding system 100.

Diagnostic Description Encoding Module

The diagnostic description encoding module 202 is configured to receive diagnostic descriptions and generate a latent representation of the diagnostic descriptions in the form of an encoded DD vector 216.

In one embodiment, the diagnostic description encoding module 202 employs a LSTM recurrent neural network (RNN) to encode the diagnosis descriptions. An example of such a recurrent network is described in Martin Sundermeyer, Ralf Schluter, and Hermann Ney. LSTM neural networks for language modeling. In Thirteenth Annual Conference of the International Speech Communication Association, 2012, the disclosure of which is herein incorporated by reference.

LSTM is a popular variant of the recurrent neural network. Due to the capacity of capturing long-range semantics in texts, LSTM is widely used for language modeling and sequence encoding. An LSTM recurrent network consists of a sequence of units, each of which models one item in the input sequence. For each diagnosis description, both character-level LSTM network and word-level LSTM network are used to obtain its hidden representation. The reason for using character-aware encoding is: there are many medical terms with the same suffix denoting similar diseases and the character level LSTM captures such characteristics. With reference to FIG. 3, the hidden representations of the written diagnosis descriptions are denoted as h₁; . . . ; h_(m), where m is the number of diagnosis descriptions in one record.

In another embodiment, the diagnostic description encoding module 202 employs a sequential long short-term memory (SLSTM) network to encode each description individually. The weight parameters of this SLSTM are tied with those of the SLSTM adopted by the ICD code description encoding module 204 for encoding ICD code descriptions, as described below.

ICD Code Description Encoding Module

The ICD code description encoding module 204 is configured to receive ICD codes 212 and generate a latent representation of the codes in the form of an encoded ICD vector 218. Each ICD code has a description (a sequence of words) that tells the semantics of this code.

In one embodiment, the ICD code description encoding module 204 adopts the same two-level LSTM architecture, i.e., character-level and word-level, used for diagnostic description encoding, to obtain the hidden representation of its textual description. The parameters of the neural networks for the ICD code description encoding module 204 and the diagnostic description encoding module 202 are not tied, in order to learn different language styles of these two sets of texts. With reference to FIG. 3, the hidden representations of different ICD codes are denoted as u₁; . . . ; u_(n) where n is the total number of codes.

In another configuration, the ICD code description encoding module 204 employs a sequential LSTM (SLSTM) to encode this description. To capture the hierarchical relationship among codes, a tree-of-sequences LSTM (TLSTM) is built along the code tree. The inputs of TLSTM include the code hierarchy and the hidden states of individual codes produced by the SLSTMs. It consists of a bottom-up TLSTM and a top-down TLSTM, which produce two hidden states h↑ and h↓ at each node in the tree. A more detailed description of a TLSTM is presented below.

In either configuration, the ICD code description encoding module 204 takes the textual descriptions of the ICD codes 212 and their hierarchical structure as inputs and produces a latent representation for each code and includes the representation in an ICD vector 218. The representation aims at simultaneously capturing the semantics of each code and the hierarchical relationship among codes. By incorporating the code hierarchy, the model can avoid selecting codes that are subtypes of the same disease and promote the selection of codes that are clinically correlated.

Tree-of-Sequences Long Short-Term Memory (LSTM) Network

As mention above, each ICD code has a description (a sequence of words) that tells the semantics of this code. The ICD code description encoding module 204 employs a sequential LSTM (SLSTM) to encode this description. To capture the hierarchical relationship among codes, a tree-of-sequences LSTM (TLSTM) is built along the code tree. The inputs of this LSTM include the code hierarchy and the hidden states of individual codes produced by the SLSTMs. It consists of a bottom-up TLSTM and a top-down TLSTM, which produce two hidden states h↑ and h↓ at each node in the tree.

Sequential LSTM: A SLSTM network is a special type of recurrent neural network that (1) learns the latent representation (which usually reflects certain semantic information) of words; (2) models the sequential structure among words. In the word sequence, each word t is allocated with an SLSTM unit, which consists of the following components: input gate i_(t), forget gate f_(t), output gate o_(t), memory cell c_(t), and hidden state s_(t). These components (vectors) are computed as follows:

i _(t)=σ(W ^((i)) _(s) _(t-1) +U ^((i)) _(x) _(t) +b ^((i)))

f _(t)=σ(W ^((i)) _(s) _(t-1) +U ^((f)) _(x) _(t) +b ^((f)))

σ_(t)=σ(W ^((o)) _(s) _(t-1) +U ^((o)) _(x) _(t) +b ^((o)))

c _(t) =i _(t)⊙ tan h(W ^((c)) _(s) _(t-1) +U ^((c)) _(x) _(t) +b ^((c)))+f _(t) ⊙c _(t-1)

s _(t) =o _(t)⊙ tan h(c _(t)),  (Eq. 1)

where x_(t) is the embedding vector of word t. W, U are component-specific weight matrices and b are bias vectors.

Tree-of-sequences LSTM: A bi-directional tree LSTM (TLSTM) captures the hierarchical relationships among code. The inputs of this TLSTM include the code hierarchy and the hidden states of individual codes produced by the SLSTMs. It consists of a bottom-up TLSTM and a top-down TLSTM, which produce two hidden states h↑ and h↓ at each node in the tree.

In the bottom-up TLSTM, an internal node (representing a code C, having M children) is comprised of the following components: an input gate ii, an output gate o↑, a memory cell c↑, a hidden state h_(↑), and M child-specific forget gates {f_(↑) ^((m))}_(m=1) ^(M) where f_(↑) ^((m)) corresponds to the m-th child. The transition equations among components are:

i _(↑)=σ(Σ_(m=1) ^(M)(W _(↑) ^((i,m)) h _(↑) ^((m)) +U ^((i)) s+b _(↑) ^((i)))

∀m,f _(↑) ^((m))=σ(Σ_(m=1) ^(M)(W _(↑) ^((f,m)) h _(↑) ^((m)) +U ^((f,m)) +b _(↑) ^((f,m)))

O _(↑)=σ(Σ_(m=1) ^(M)(W _(↑) ^((o,m)) h _(↑) ^((m)) +U ^((o)) s+b _(↓) ^((o)))

u _(↓)=tan h(Σ_(m=1) ^(M)(W _(↑) ^((u,m)) h _(↑) ^((m)) +U ^((u)) s+b _(↑) ^((u)))

c _(↑) =i _(↑) ⊙u _(↑)+Σ_(m=1) ^(M) f _(↑) ^((m)) ⊙c _(↑) ^((m))

h _(↑) =o _(↓)⊙ tan h(c _(↓)),  (Eq. 2)

where s is the SLSTM hidden state that encodes the name of the concept C. {h_(↑) ^((m)))}_(m=1) ^(M) and {c_(↑) ^((m))}_(m=1) ^(M) are the bottom-up TLSTM hidden states and memory cells of the children.

W, U, b are component-specific weight matrices and bias vectors. For a leaf node having no children, its only input is the SLSTM hidden state s and no forget gates are needed. The transition equations are:

i _(↑)=σ(U ^((i)) s+b _(↓) ^((i)))

o _(↑)=σ(U ^((o)) s+b _(↑) ^((o)))

u _(↑)=tan h(U ^((u)) s+b _(↑) ^((u)))

c _(↑) =i _(↑) ⊙u _(↑)

h _(↑) =o _(↑)⊙ tan hc _(↑)  (Eq. 3)

In the top-down TLSTM, for a non-root node, it has the following components: an input gate i_(↓), a forget gate f_(↓), an output gate o_(↓), a memory cell c_(↓), and a hidden state h_(↓). The transition equations are:

i _(↓)=σ(W _(↓) ^((i)) h _(↓) ^((p)) +b _(↓) ^((i)))

f _(↓)=σ(W _(↓) ^((f)) h _(↓) ^((p)) +b _(↓) ^((f)))

o _(↑)=σ(W _(↓) ^((o)) h _(↓) ^((p)) +b _(↓) ^((o)))

u _(↓)=tan h(W ^((u)) h _(↓) ^((p)) +b _(↓) ^((u)))

c _(↓) =i _(↓) ⊙u _(↓) +f _(↓) ⊙c _(↓) ^((p))

h _(↑) =o _(↓)⊙ tan h(c _(↓)),  (Eq. 4)

where h_(↓) ^((p)) and c_(↓) ^((p)) are the top-down TLSTM hidden state and memory cell of the parent of this node. For the root node which has no parent, h_(↓) cannot be computed using the above equations. Instead, we set h_(↓) to h_(↑) (the bottom-up TLSTM hidden state generated at the root node). h_(↑) captures the semantics of all codes, which is then propagated downwards to each individual code via the top-down TLSTM dynamics

The hidden states of the two directions are concatenated to obtain the bidirectional TLSTM encoding of each concept h=[h_(↑); h_(↓)]. The bottom-up TLSTM composes the semantics of children (representing sub-codes) and merges them into the current node, which hence captures child-to-parent relationship. The top-down TLSTM makes each node inherit the semantics of its parent, which captures parent-to-child relation. As a result, the hierarchical relationship among codes is encoded in the hidden states.

To address the issue of overfitting described above in the background section, diversity-promoting regularization may be leveraged to improve the performance of the ICD code description encoding module 204. Diversity-promoting regularization imposes a structural constraint on parameters of the ICD code description encoding module 204, which reduces the model capacity and therefore improves generalization performance on unseen data.

In accordance with embodiments disclosed herein, overfitting may be alleviated using a diversity-promoting regularization in the form of 1) a uniform eigenvalue regularizer (UER) applied to an LSTM network, or 2) angular constraints.

Uniform Eigenvalue Regularizer

In some embodiments, uncorrelation among components may be characterized from a statistical perspective by treating components as random variables and measuring their covariance which is proportional to their correlation. In one embodiment, A∈

^(d×m) denotes the component matrix whose k-th column is the parameter vector a_(k) of component k. In some embodiments, a row view of A: may be used where each component is treated as a random variable and each row vector ã_(i) ^(T) is a sample drawn from the random vector formed by the m components. Further,

$\mu = {{\frac{1}{d}{\sum\limits_{i = 1}^{d}{\overset{\sim}{a}}_{i}}} = {\frac{1}{d}A^{T}1}}$

may be set as the sample mean, where the elements of 1∈

^(d) are all 1. An empirical covariance matrix may then be computed with the components as:

$\begin{matrix} {G = {{\frac{1}{d}{\sum\limits_{i = 1}^{d}{\left( {{\overset{\sim}{a}}_{i} - \mu} \right)\left( {{\overset{\sim}{a}}_{i} - \mu} \right)^{T}}}} = {{\frac{1}{d}A^{T}A} - {\left( {\frac{1}{d}A^{T}1} \right){\left( {\frac{1}{d}A^{T}1} \right)^{T}.}}}}} & \left( {{Eq}.\mspace{14mu} 5} \right) \end{matrix}$

By imposing the constraint A^(T)1=0, therefore

$G = {\frac{1}{d}A^{T}{A.}}$

Suppose A is a full rank matrix and m>d, then G is a full-rank matrix with rank m.

For the next step, the eigenvalues of G play important roles in characterizing the uncorrelation and evenness of components. Let G=Σ_(k=1) ^(m)λ_(k)u_(k)u_(k) ^(T) be the eigendecomposition where λ_(k) is an eigenvalue and u_(k) is the associated eigenvector. In principle component analysis, an eigenvector u_(k) of the covariance matrix G represents a principal direction of the data points and the associated eigenvalue λ_(k) tells the variability of points along that direction. The larger λ_(k) is, the more spread out the points along the direction u_(k). When the eigenvectors (principal directions) are not aligned with the coordinate axis, the level of disparity among eigenvalues indicates the level of correlation among the m components (random variables). The more different the eigenvalues are, the higher the correlation is. Considering this, the uniformity among eigenvalues of G can be utilized to measure how uncorrelated the components are.

Secondly, the eigenvalues are related with the other factor of diversity: evenness. When the eigenvectors are aligned with the coordinate axis, the components are uncorrelated. In this case, evenness is used to measure diversity. In this example, each component is assigned an importance score. Since the eigenvectors are in parallel to the coordinate axis, the eigenvalues reflect the variance of components. Analogous to principle component analysis which posits that random variables with larger variance are more important, the present embodiment may use variance to measure importance. According to the evenness criteria, the components are more diverse if their importance scores match, which motivates us to encourage the eigenvalues to be uniform.

To sum up, the eigenvalues are encouraged to be even in both cases: (1) when the eigenvectors are not aligned with the coordinate axis, they are preferred to be even to reduce the correlation of components; (2) when the eigenvectors are aligned with the coordinate axis, they are encouraged to be even such that different components contribute equally in modeling data.

In some embodiments, to promote uniformity among eigenvalues, as a general approach, eigenvalues may be normalized into a probability simplex and then the discrete distribution parameterized by the normalized eigenvalues may be encouraged to have small Kullback-Leibler (KL) divergence with the uniform distribution. Given the eigenvalues {λ_(k)}_(k=1) ^(m), they are then normalized into a probability simplex

${\hat{\lambda}}_{k} = \frac{\lambda_{k}}{\sum_{j = 1}^{m}\lambda_{k}}$

based on which a distribution is defined on a discrete random variable X=1, . . . , m where p(X=k)={circumflex over (λ)}_(k).

In addition, to ensure the eigenvalues are strictly positive, A^(T)A may be set to be positive definite. To encourage {λ_(k)}_(k=1) ^(m) to be uniform, the distribution p (X) is set be “close” to a uniform distribution

${{q\left( {X = k} \right)} = \frac{1}{m}},$

where the “closeness” is measured using KL divergence

$\begin{matrix} {{{{KL}\left( {p\mspace{11mu} \text{}\mspace{11mu} q} \right)}:{\sum_{k = 1}^{m}{{\hat{\lambda}}_{k}\mspace{11mu} \log \frac{{\hat{\lambda}}_{k}}{1\text{/}m}}}} = {\frac{\sum_{k = 1}^{m}{\lambda_{k}\; \log \mspace{11mu} \lambda_{k}}}{\sum_{j = 1}^{m}\lambda_{j}} - {\log {\sum_{j = 1}^{m}\lambda_{j}}} + {\log \mspace{11mu} {m.}}}} & \left( {{Eq}.\; 6} \right) \end{matrix}$

In this equation, Σ_(k=1) ^(m) λ_(k) log λ_(k) is equivalent to

${{tr}\left( {\left( {\frac{1}{d}A^{T}A} \right){\log \left( {\frac{1}{d}A^{T}A} \right)}} \right)},$

where log(⋅) denotes matrix logarithm. To show this, note that

${{\log \left( {\frac{1}{d}A^{T}A} \right)} = {\sum\limits_{k = 1}^{m}{{\log \left( \lambda_{k} \right)}u_{k}u_{k}^{T}}}},$

according to the property of matrix logarithm. Then,

${tr}\left( {\left( {\frac{1}{d}A^{T}A} \right){\log \left( {\frac{1}{d}A^{T}A} \right)}} \right)$

is equal to tr((Σ_(k=1)λ_(k)u_(k)u_(k) ^(T))(Σ_(k=1) ^(m) log(λ)k u_(k)u_(k) ^(T))) which equals to Σ_(k=1) ^(m) λ_(k) log λ_(k). According to the property of trace,

${{tr}\left( {\frac{1}{d}A^{T}A} \right)} = {\sum\limits_{k = 1}^{m}{\lambda_{k}.}}$

Then the KL divergence can be turned into a diversity-promoting uniform eigenvalue regularizer (UER):

$\begin{matrix} {{\frac{{tr}\left( {\left( {\frac{1}{d}A^{T}A} \right){\log \left( {\frac{1}{d}A^{T}A} \right)}} \right)}{{tr}\left( {\frac{1}{d}A^{T}A} \right)} - {\log \; {{tr}\left( {\frac{1}{d}A^{T}A} \right)}}},} & \left( {{Eq}.\mspace{14mu} 7} \right) \end{matrix}$

subject to A^(T)A

0 and A^(T)1=0.

UER then may be applied to promote diversity. For example, let

(A) denote the objective function of an ML model, then a UE-regularized ML problem can be defined as

${\min_{A}{\mathcal{L}(A)}} + {\lambda \left( {\frac{{tr}\left( {\left( {\frac{1}{d}A^{T}A} \right){\log \left( {\frac{1}{d}A^{T}A} \right)}} \right)}{{tr}\left( {\frac{1}{d}A^{T}A} \right)} - {\log \; {{tr}\left( {\frac{1}{d}A^{T}A} \right)}}} \right)}$

subject to A^(T)A

0 and A^(T)1=0, where λ is the regularization parameter. In principle, the “closeness” between p and q can be measured by other distances such as the total variation distance, Hellinger distance, etc. However, the resultant formula defined on the eigenvalues (like the one in Eq. 6) is very difficult (if possible) to be transformed into a formula defined on A (like the one in Eq. 7). Consequently, it is very challenging to perform estimation of A. In light of this, we choose to use the KL divergence.

Compared with previous diversity-promoting regularizers, UER has the following benefits: (1) It measures the diversity of all components in a holistic way, rather than reducing to pairwise dissimilarities as other regularizers do. This enables UER to capture global relations among components. (2) Unlike determinant-based regularizers that are sensitive to vector scaling, UER is derived from normalized eigenvalues where the normalization effectively removes scaling. (3) UER is amenable for computation. First, unlike the decorrelation regularizer that is defined over data-dependent intermediate variables and thus incurs computational inefficiency, UER is directly defined on model parameters and is independent of data. Second, unlike the regularizers that are non-smooth, UER is a smooth function. In general, smooth functions are more amenable for deriving optimization algorithms than non-smooth functions. The dominating computation in UER is matrix logarithm. It does not substantially increase computational overhead as long as the number of components is not too large (e.g., less than 1000).

Uniform Eigenvalue Regularized LSTM

The LSTM network is a type of recurrent neural network, that is better at capturing long-term dependency in sequential modeling. At each time step t where the input is x_(t), there is an input gate i_(t), a forget gate f_(t), an output gate o_(t), a memory cell c_(t), and a hidden state h_(t). The transition equations among them are:

i _(t)=σ(W ^((i)) _(x) _(t) +U ^((i)) _(h) _(t-1) +b ^((i))

f _(t)=σ(W ^((f)) _(x) _(t) +U ^((f)) _(h) _(t-1) +b ^((f)))

o _(t)=σ(W ^((o)) _(x) _(t) +U ^((o)) h _(t-1) +b ^((o))

c _(t) =i _(t)⊙ tan h(W ^((c)) _(x) _(t) +U ^((c)) _(h) _(t-1) +b ^((c)))+f _(t) ⊙c _(t-1)

h _(t) =o _(t)⊙ tan h(c _(t)),  (Eq. 8)

where

={W^((s))|_(s)∈S={i, f, o, c}} and

={U^((s))|_(s)∈S} are gate-specific weight matrices and

={b^((s))|_(s)∈S} are bias vectors. The row vectors in W and U are treated as components. Let

denote the loss function of an LSTM network and

(⋅) denote the UER (including constraints), then a UE-regularized LSTM problem can be defined as:

(

)+λ

(

W ^((s))+

(U ^((s))))  (Eq. 9)

The LSTM network is applied for doze-style reading comprehension (CSRC).

Angular Constraints

Near-orthogonality may be used to represent “diversity”, using a regularization approach-angular constraints (ACs) where the angle between components is constrained to be close to π/2 which hence encourages the components to be close to being orthogonal. Analysis shows that the closer to π/2 the angles are, the smaller the estimation error is and the larger the approximation error is. The best tradeoffs of these two errors can be explored by properly tuning the angles. An algorithm based on the alternating direction method of multipliers (ADMM) solves the angle-constrained problems.

Angular constraints (ACs) use near-orthogonality to characterize “diversity” and encourage the angles between component vectors to be close to π/2. The ACs are defined as requiring the absolute value of the cosine similarity between each pair of components to be less equal to a small value τ, which leads to the following angle-constrained problem:

$\begin{matrix} {\min\limits_{W}\; {{()}}} & \left( {{Eq}.\mspace{14mu} 10} \right) \\ {{{s.t.\mspace{14mu} 1} \leq i < j \leq m},{\frac{{w_{i} \cdot w_{j}}}{{w_{i}}_{2}{w_{j}}_{2}} \leq \tau},} & \; \end{matrix}$

where

={w_(i)}_(i=1) ^(m) denotes the component vectors and

(

) is the objective function of this problem. The parameter τ controls the level of near-orthogonality (or diversity). A smaller τ indicates that the vectors are closer to being orthogonal, and hence are more diverse. As will be shown later, representing diversity using the angular constraints facilitates theoretical analysis and is empirically effective as well.

Adversarial Reconciliation Module

The writing styles of diagnostic descriptions (DDs) and code descriptions (CDs) are largely different, which makes the matching between a DD and a CD error-prone. To address this issue, an adversarial learning approach reconciles the writing styles. On top of the latent representation DD vectors 216, a discriminative network is built to distinguish which inputs are DDs and which are CDs. The diagnostic description encoding module 202 and the ICD code description encoding module 204 try to make such a discrimination impossible. By doing this, the learned representations are independent of the writing styles and facilitate more accurate matching.

To this end, an adversarial learning approach is used to reconcile the different writing styles of diagnosis descriptions and code descriptions. The basic idea is: after encoded, if a description cannot be discerned to be a DD or a CD, then the difference in their writing styles is eliminated. A discriminative network included in the adversarial reconciliation module 206 takes the encoding vector of a diagnosis description as input and tries to identify it as a DD or CD. The diagnostic description encoding module 202 and the ICD code description encoding module 204 adjust their weight parameters so that such a discrimination is difficult to be achieved by the discriminative network.

Consider all the diagnosis descriptions {t_(r), y_(r)}_(r=1) ^(R) where t_(r) is a description and y_(r) is a binary label. y_(r)=1 if t_(r) is a DD and y_(r)=0 if otherwise. Let f (t_(r); W_(s)) denote the sequential LSTM (SLSTM) encoder parameterized by W_(s). This SLSTM encoder is shared by the diagnostic description encoding module 202 and the ICD code description encoding module 204. Note that for CDs, a TLSTM is further applied on top of the encodings produced by the SLSTM. The SLSTM encoding vectors of CDs are used as the input of the discriminative network rather than using the TLSTM encodings since the latter are irrelevant to writing styles. Let g(f(t_(r); W_(s)); W_(d)) denote the discriminative network parameterized by W_(d). It takes the encoding vector f (t_(r); W_(s)) as input and produces the probability that t_(r) is a DD. Adversarial learning is performed by solving this problem:

$\begin{matrix} {{\max\limits_{W_{s}}\; {\min\limits_{W_{d}}\; \mathcal{L}_{adv}}} = {\sum\limits_{r = 1}^{R}{{CE}\left( {{g\left( {{f\left( {t_{r};W_{s}} \right)};W_{d}} \right)},y_{r}} \right)}}} & \left( {{Eq}.\mspace{14mu} 11} \right) \end{matrix}$

The discriminative network tries to differentiate DDs from CDs by minimizing this classification loss while the encoder maximizes this loss so that DDs and CDs are not distinguishable.

Attentional Matching Module

The attentional matching module 208 is configured to map diagnostic descriptions to ICD codes. The encoded DD vectors 216 and the encoded ICD vectors 218 are fed into the attentional matching module 208 to perform code assignments. The attentional matching module 208 allows multiple diagnostic descriptions to be matched to a single code and allows a single diagnostic description to be matched to multiple codes. An order of importance among codes is incorporated by the isotonic constraints module 210. These constraints regulate the weight parameters of the model so that codes with higher importance are given larger prediction scores.

Typically, the number of written diagnosis descriptions does not equal to the number of assigned ICD codes. Accordingly, the attentional matching module 208 disclosed herein is configured to take all diagnosis descriptions into account during coding by adopting an attention strategy. The attentional matching module 208 provides a recipe for choosing which diagnosis descriptions are important when performing coding. For the i-th ICD code, an importance score or attention score a_(i,j) on the _(j)-th diagnosis description is calculated as u_(i) ^(T)h_(j). The attentional matching module 208 may utilize these attention scores based on a hard selection mechanism or a soft attention mechanism.

The hard selection mechanism is based on the assumption that the most related diagnosis description plays a decisive role when assigning ICD codes. In this mechanism, for each ICD, the dominating diagnosis is defined as the one that has the maximum attention score among all diagnosis descriptions. The probability of the i-th ICD code being assigned is thus: p_(i)=sigmoid(max_(ij=1, . . . ,m)a_(ij)).

A soft-attention mechanism may be used to calculate an attention score or importance score between a diagnostic description and a plurality of ICD codes. An example of such a mechanism is described in Dzmitry Bandanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2014, the disclosure of which is herein incorporated by reference.

Instead of choosing the single maximum attention score, as is done in the hard selection mechanism, the soft-attention mechanism applies a softmax function to normalize the attention scores among all diagnosis descriptions into a probability simplex. The normalized attention scores are utilized as the weights of different diagnosis descriptions. The weighted average over the hidden representations of different diagnosis descriptions is used as the attentional hidden vector. In this way, the attentional hidden vector can take into account all diagnosis descriptions with varying levels of attention.

In the soft-attention mechanism, the hidden representations of diagnostic descriptions and codes is denoted as {h_(m)}_(m=1) ^(M) and {u_(n)}_(n=1) ^(N) respectively, where M is the number of diagnostic descriptions of one patient and N is the total number of codes in the dataset. The mapping from diagnostic descriptions to codes is not one-to-one. In many cases, a code is assigned only when a certain combination of K (1<K≤M) diseases simultaneously appear within the M diagnostic descriptions and the value of K depends on this code. Among the K diseases, their importance of determining the assignment of this code is different. For the rest M-K diagnostic descriptions, their importance is considered to be zero.

For a code u_(n), the importance of a diagnostic description h_(m) to u_(n) is calculated as ã_(nm)=u_(n) ^(T)h_(m). The scores {ã_(nm)}_(m=1) ^(M) of all diagnostic descriptions are normalized into a probabilistic simplex using the softmax operation: ã_(nm)=exp(a_(nm))/Σ_(l=1) ^(M) exp(a_(nl)) Given these normalized importance scores {ã_(nm)}_(m=1) ^(M), the scores are used to weight the representations of diagnostic descriptions and obtain a single attentional vector of the M diagnostic descriptions: ĥ_(n)=Σ_(m=1) ^(M)ã_(nm)h_(m). Vectors ĥ_(n) and u_(n) are concatenated and a linear classifier is used to predict the probability that code n should be assigned: p_(n)=sigmoid(w_(n) ^(T)[ĥ_(n); u_(n)]+b_(n)), where the coefficients w_(n) and bias b_(n) are specific to code n.

The weight parameters Θ of the model are trained using the data of L patient visits. Θ includes the SLSTM weights W_(s), TLSTM weights W_(t), and weights W_(p) in the final prediction layer. Let c^(l)∈

^(N) be a binary vector where c_(n) ^(l)=1 if the n-th code is assigned to this patient and c_(n) ^(l)=0 if otherwise. Θ can be learned by minimizing the following prediction loss:

min_(Θ)

(Θ)=Σ_(l=1) ^(L)Σ_(n=1) ^(N) CE(p _(n) ^((l)) ,c _(n) ^((l))),  (Eq. 12)

where p_(n) ^((l)) is the predicted probability that code n is assigned to patient visit l and p_(n) ^((l)) is a function of Θ. CE(⋅, ⋅) is the cross-entropy loss.

Isotonic Constraints Module

Next, the importance order among ICD codes is incorporated. For the D^((l)) codes assigned to patient l, without loss of generality, the order is assumed 1

2 . . .

D^((l)) (the order is given by human coders as ground truth in the MIMIC-III dataset). The predicted probability p_(i)(1≤i≤D^((l))) is used to characterize the importance of code i. To incorporate the order, an isotonic constraint is imposed on the probabilities p₁ ^((l))

p₂ ^((l)) . . .

p_(D) _((l)) ^((l)): and the following problem is solved:

min_(Θ)

_(pred)(Θ)+max_(w) _(d) (−λ

_(adv)(W _(s) ,W _(d)))

s.t.p ₁ ^((l))

p ₂ ^((l)) . . .

p _(D) _((l)) ^((l))

∀l=1, . . . ,L  (Eq. 13)

where the probabilities p_(i) ^((l)) are functions of Θ and λ is a tradeoff parameter.

An algorithm based on the alternating direction method of multiplier (ADMM) [52] is developed to solve the problem defined in Eq. 13. Let p^((l)) be a |D^((l))|-dimensional vector where the i-th element is p_(i) ^((l)). The problem is written into an equivalent form

min_(Θ)

_(pred)(Θ)+max_(w) _(d) (−λ

_(adv)(W _(s) ,W _(d)))

s.t.p ₁ ^((l))

p ₂ ^((l))

q ₁ ^((l))

q ₂ ^((l)) . . .

q _(|D) _((l)) _(|) ^((l))

∀l=1, . . . ,L  (Eq. 14)

Then the augmented Lagrangian is written

min Θ , q , v   ℒ pred  ( Θ ) + max W d  ( - λ  adv  ( W s , W d ) ) + 〈 p ( l ) - q ( l ) , v ( l ) 〉 + ρ 2   p ( l ) - q ( l )  2 2 ) ( Eq .  15 )  s . t .  q 1 ( l ) ≻ q 2 ( l )   … ≻ q  D ( l )  ( l )  ∀ l = 1 , …  , L

This problem is solved by alternating between {p^((l))}_(l=1) ^(L), {q^((l))}_(l=1) ^(L) and {v^((l))}_(l=1) ^(L). The subproblem defined over q^((l)) is

$\begin{matrix} {{\min_{q^{(l)}}{- {\langle{q^{(l)},v^{(l)}}\rangle}}} + {\frac{\rho}{2}{{p^{(l)} - q^{(l)}}}_{2}^{2}}} & \left( {{Eq}.\mspace{14mu} 16} \right) \\ {{s.t.\mspace{14mu} q_{1}^{(l)}} \succ {q_{2}^{(l)}\mspace{14mu} \ldots} \succ q_{d^{(l)}}^{(l)}} & \; \end{matrix}$

which is an isotonic projection problem and can be solved via the algorithm proposed in Yao-Liang Yu and Eric P Xing. Exact algorithms for isotonic regression and related. In Journal of Physics: Conference Series, volume 699, page 012016. IOP Publishing, 2016. With {q^((l))}_(l=1) ^(L) and {v^((l))}_(l=1) ^(L) fixed, the sub-problem is min_(Θ)

_(pred)(Θ)+max_(w) _(d) (−λ

_(adv)(W_(s), W_(d))) which can be solved using stochastic gradient descent (SGD). The update of v^((l)) is simple: v^((l))=v^((l))+ρ(p^((l))−q^((l))).

FIG. 4 is a block diagram of a computing device 400 that embodies the ICD coding system 100 of FIG. 1. The computing device 400 is specially configured to execute instructions related to the ICD code assignment process described above, including the application of machine-learned algorithms to diagnostic description records. Computers capable of being specially configured to execute such instructions may be in the form of a laptop, desktop, workstation, or other appropriate computers.

The computing device 400 includes a central processing unit (CPU) 402, a memory 404, e.g., random access memory, and a computer readable media 406 that stores program instructions that enable the CPU and memory to implement the functions of the diagnostic description encoding module 102 and the ICD code assignment module 104 of the ICD coding system 100 described above with reference to FIG. 1. The computing device 400 also includes a user interface 408 and a display 410, and an interface bus 412 that interconnects all components of the computing device.

Computer readable media 406 is suitable for storing ICD code system processing instructions include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, flash memory devices, magnetic disks, magneto optical disks and CD ROM and DVD-ROM disks. In operation, the CPU 402 and memory 404 executes the ICD coding system processing instructions stored in the computer readable media 406 to thereby perform the functions of the diagnostic description encoding module 102 and the ICD code assignment module 104.

The user interface 408, which may be a keyboard or a mouse, and the display 410 allow for a clinician to interface with the computing device 400. For example, a clinician seeking to obtain a set of ICD codes for a subject patient, may input a diagnostic description record of a subject patient for processing. The clinician may then initiate execution of the ICD coding system processing instructions stored in the computer readable media 406 through the user interface 408, and await a display of the predicted medications.

FIG. 5 is a schematic block diagram of an apparatus 500. The apparatus 500 may correspond to one or more processors configured to develop and train the machine-learned algorithm included in the ICD coding system of FIG. 1. The apparatus 500 may be embodied in any number of processor-driven devices, including, but not limited to, a server computer, a personal computer, one or more networked computing devices, an application-specific circuit, a minicomputer, a microcontroller, and/or any other processor-based device and/or combination of devices.

The apparatus 500 may include one or more processing units 502 configured to access and execute computer-executable instructions stored in at least one memory 504. The processing unit 502 may be implemented as appropriate in hardware, software, firmware, or combinations thereof. Software or firmware implementations of the processing unit 502 may include computer-executable or machine-executable instructions written in any suitable programming language to perform the various functions described herein. The processing unit 502 may include, without limitation, a central processing unit (CPU), a digital signal processor (DSP), a reduced instruction set computer (RISC) processor, a complex instruction set computer (CISC) processor, a microprocessor, a microcontroller, a field programmable gate array (FPGA), a System-on-a-Chip (SOC), or any combination thereof. The apparatus 500 may also include a chipset (not shown) for controlling communications between the processing unit 502 and one or more of the other components of the apparatus 500. The processing unit 502 may also include one or more application-specific integrated circuits (ASICs) or application-specific standard products (ASSPs) for handling specific data processing functions or tasks.

The memory 504 may include, but is not limited to, random access memory (RAM), flash RAM, magnetic media storage, optical media storage, and so forth. The memory 504 may include volatile memory configured to store information when supplied with power and/or non-volatile memory configured to store information even when not supplied with power. The memory 504 may store various program modules, application programs, and so forth that may include computer-executable instructions that upon execution by the processing unit 502 may cause various operations to be performed. The memory 504 may further store a variety of data manipulated and/or generated during execution of computer-executable instructions by the processing unit 502.

The apparatus 500 may further include one or more interfaces 506 that may facilitate communication between the apparatus and one or more other apparatuses. For example, the interface 506 may be configured to receive records of diagnostic descriptions and records of ICD code descriptions. Communication may be implemented using any suitable communications standard. For example, a LAN interface may implement protocols and/or algorithms that comply with various communication standards of the Institute of Electrical and Electronics Engineers (IEEE), such as IEEE 802.11, while a cellular network interface implement protocols and/or algorithms that comply with various communication standards of the Third Generation Partnership Project (3GPP) and 3GPP2, such as 3G and 4G (Long Term Evolution), and of the Next Generation Mobile Networks (NGMN) Alliance, such as 5G.

The memory 504 may store various program modules, application programs, and so forth that may include computer-executable instructions that upon execution by the processing unit 502 may cause various operations to be performed. For example, the memory 504 may include an operating system module (O/S) 508 that may be configured to manage hardware resources such as the interface 506 and provide various services to applications executing on the apparatus 500.

The memory 504 stores additional program modules such as: (1) a DD encoding module that receives diagnostic descriptions and generates latent representations of the diagnostic descriptions in the form of an encoded DD vectors; (2) an ICD encoding module 512 that receives ICD codes and generates latent representations of the codes in the form of an encoded ICD vectors; (3) an adversarial reconciliation module 514 that reconciles the different writing styles of diagnostic descriptions and ICD code descriptions; (4) an attention matching module that maps diagnostic descriptions to ICD codes; and (5) an isotonic constraints module 518 that establishes an order of importance for ICD codes. Each of these modules includes computer-executable instructions that when executed by the processing unit 502 cause various operations to be performed, such as the operations described above.

The apparatus 500 and modules disclosed herein may be implemented in hardware or software that is executed on a hardware platform. The hardware or hardware platform may be a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic component, discrete gate or transistor logic, discrete hardware components, or any combination thereof, or any other suitable component designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing components, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP, or any other such configuration.

Evaluation

We performed the study on the publicly available MIMIC-III dataset, which contains de-identified electronic health records (EHRs) of 58,976 patient visits in the Beth Israel Deaconess Medical Center from 2001 to 2012. Each EHR has a clinical note called discharge summary, which contains multiple sections of information, such as ‘discharge diagnosis’, ‘past medical history’, etc. From the ‘discharge diagnosis’ and ‘final diagnosis’ sections, we extract the diagnosis descriptions (DDs) written by physicians. Each DD is a short phrase or a sentence, articulating a certain disease or condition. Medical coders perform ICD coding mainly based on DDs. Following such a practice, in this paper, we set the inputs of the automated coding model to be the DDs while acknowledging that other information in the EHRs is also valuable and is referred to by coders for code assignment. For simplicity, we leave the incorporation of non-DD information to future study.

Each patient visit is assigned with a list of ICD codes, ranked in descending order of importance and relevance. For each visit, the number of codes is usually not equal to the number of diagnosis descriptions. These ground truth codes serve as the labels to train our coding model. The entire dataset contains 6,984 unique codes, each of which has a textual description, de-scribing a disease, symptom, or condition. The codes are organized into a hierarchy where the top-level codes correspond to general diseases while the bottom-level ones represent specific dis-eases. In the code tree, children of a node represent subtypes of a disease. Table 1 below shows the diagnosis descriptions of a patient visit and the assigned ICD codes. Inside the parentheses are the descriptions of the codes. The codes are ranked according to descending importance.

TABLE 1 Diagnosis Descriptions 1. Prematurity at 35 4/7 weeks gestation 2. Twin number two of twin gestation 3. Respiratory distress secondary to transient  tachypnea of the newborn 4. Suspicion for sepsis ruled out Assigned ICD Codes 1. V31.00 (Twin birth, mate liveborn, born in hospital,     delivered without mention of cesarean section) 2. 765.18 (Other preterm infants, 2,000-2,499 grams) 3. 775.6 (Neonatal hypoglycemia) 4. 770.6 (Transitory tachypnea of newborn) 5. V29.0 (Observation for suspected infectious condition) 6. V05.3 (Need for prophylactic vaccination and     inoculation against viral hepatitis)

Experimental Settings: Out of the 6,984 unique codes, we selected 2,833 codes that have the top frequencies to perform the study. We split the data into a train/validation/test dataset with 40 k/7 k/12 k patient visits respectively. The hyperparameters were tuned on the validation set. The SLSTMs are bidirectional and dropout with 0.5 probability was used. The size of hidden states in all LSTMs was set to 100. The word embeddings were trained on the fly and their dimension was set to 200. The tradeoff parameter λ was set to 0.1. The parameter ρ in the ADMM algorithm was set to_1. In the SGD algorithm for solving min_(Θ)

_(pred)(Θ)+max_(w) _(d) (−λ

_(adv)(W_(s), W_(d))), we used the Adam optimizer with an initial learning rate 0.001 and a mini-batch size 20. Sensitivity (true positive rate) and specificity (true negative rate) were used to evaluate the code assignment performance. We calculated these two scores for each individual code on the test set, then took a weighted (proportional to codes'frequencies) average across all codes. To evaluate the ranking performance of codes, we used the normalized discounted cumulative gain (NDCG).

Ablation Study: We performed ablation study to verify the effectiveness of each module in our model. To evaluate module X, we remove it from the model without changing other modules and denote such a baseline by No-X. The comparisons of No-X with the full model are given below in the Table 2, which shows weighted sensitivity and specificity on the test set.

TABLE 2 Sensitivity Specificity Larkey and Croft 0.15 0.17 Franz et al. 0.19 0.21 Pestian et al. 0.12 0.21 Kavuluru et al. 0.09 0.11 Kavuluru et al. 0.21 0.25 Koopman et al. 0.18 0.20 LET 0.23 0.29 HierNet 0.26 0.30 HybridNet 0.25 0.31 BranchNet 0.25 0.29 No-TLSTM 0.23 0.28 Bottom-up TLSTM 0.27 0.31 No-AL 0.26 0.31 No-IC 0.24 0.29 No-AM 0.27 0.29 Our full model 0.29 0.33

On the first panel are baselines for holistic comparison. On the second panel are baselines compared in the ablation study of tree-of-sequences LSTM for capturing hierarchical relationship. On the third panel are baselines compared in the ablation study of adversarial learning for writing-style reconciliation, isotonic constraints for ranking, and attentional matching.

Tree-of-sequences LSTM: To evaluate this module, we compared with the two configurations: (1) No-TLSTM, which removes the tree LSTM and directly uses the hidden states produced by the sequential LSTM as the final representations of codes; (2) Bottom-up TLSTM, which re-moves the hidden states generated by the top-down TLSTM. In addition, we compared with four hierarchical classification baselines including (1) hierarchical network (HierNet), (2) HybridNet, (3) branch network (BranchNet), (4) label embedding tree (LET), by using them to replace the bidirectional tree LSTM while keeping other modules untouched. Table 2 shows the average sensitivity and specificity scores achieved by these methods on the test set. We make the following observations. First, removing tree LSTM largely degrades performance: the sensitivity and specificity of No-TLSTM is 0.23 and 0.28 respectively while our full model (which uses bidirectional TLSTM) achieves 0.29 and 0.33 respectively. The reason is No-TLSTM ignores the hierarchical relationship among codes. Second, the bottom-up tree LSTM alone performs less well than the bidirectional tree LSTM. This demonstrates the necessity of the top-down TLSTM, which ensures every two codes are connected by directed paths and can more expressively capture code-relations in the hierarchy. Third, our method outperforms the four baselines. The possible reason is our method directly builds codes' hierarchical relation-ship into their representations while the baselines learn representations and capture hierarchical relationships separately.

Next, we present some qualitative results. For a patient (admission ID 147798) having aDD ‘E Coli urinary tract infection’, without using tree LSTM, two sibling codes 585.2 (chronic kidney disease, stage II (mild))—which is the ground truth—and 585.4 (chronic kidney disease, stage IV (severe)) are simultaneously assigned possibly because their textual descriptions are very similar (only differ in the level of severity). This is incorrect because 585.2 and 585.4 are children of 585 (chronic kidney disease) and the severity level of this disease cannot simultaneously be mild and severe. After the tree LSTM is added, the false prediction of 585.4 is eliminated, which demonstrates the effectiveness of tree LSTM in incorporating one constraint induced by the code hierarchy: among the nodes sharing the same parent, only one should be selected.

For patient 197205, No-TLSTM assigns the following codes: 462 (subacute sclerosing panencephalitis), 790.29 (other abnormal glucose), 799.9 (unspecified viral infection), and 285.21 (anemia in chronic kidney disease). Among these codes, the first three are the ground truth and the fourth one is incorrect (the ground truth is 401.9 (unspecified essential hypertension)). Adding tree LSTM fixes this error. The average distance between 401.9 and the rest of ground truth codes is 6.2. For the incorrectly assigned code 285.21, such a distance is 7.9. This demonstrates that tree LSTM is able to capture another constraint imposed by the hierarchy: codes with smaller tree-distance are more likely to be assigned together.

Adversarial learning: To evaluate the efficacy of adversarial learning (AL), we remove it from the full model and refer to this baseline as No-AL. Specifically, in Eq. 13, the loss term max_(w) _(d) (−

_(adv)W_(s), W_(e))) is taken away. Table 2 shows the results, from which we observe that after AL is removed, the sensitivity and specificity are dropped from 0.29 and 0.33 to 0.26 and 0.31 respectively. No-AL does not reconcile different writing styles of diagnosis descriptions (DDs) and code descriptions (CDs). As a result, a DD and a CD that have similar semantics may be mismatched because their writing styles are different. For example, a patient (admission ID 147583) has a DD ‘h/o DVT on anticoagulation’, which contains abbreviation DVT (deep vein thrombosis). Due to the presence of this abbreviation, it is difficult to assign a proper code to this DD since the textual descriptions of codes do not contain abbreviations. With adversarial learning, our model can correctly map this DD to a ground truth code: 443.9 (peripheral vascular disease, unspecified). Without AL, this code is not selected. As another example, a DD ‘coronary artery disease, STEMI, s/p 2 stents placed in RCA’ was given to patient 148532. This DD is written informally and ungrammatically, and contains too much detailed information, e.g., ‘s/p 2 stents placed in RCA’. Such a writing style is quite different from that of CDs. With AL, our model successfully matches this DD to a ground truth code: 414.01 (coronary atherosclerosis of native coronary artery). On the contrary, No-AL fails to achieve this.

Isotonic constraint (IC): To evaluate this ingredient, we remove the ICs from Eq. 13 during training and denote this baseline as No-IC. We used NDCG to measure the ranking performance, which is calculated in the following way. Consider a testing patient-visit l where the ground truth ICD codes are

^((l)). For any code c, we define the relevance score of c to l as 0 if c∉

^((l)). and as |

^((l))|−r(c) if otherwise, where r(c) is the ground truth rank of c in

^((l)). We rank codes in descending order of their corresponding prediction probabilities and obtain the predicted rank for each code. We calculated the NDCG scores at position 2, 4, 6, 8 based on the relevance scores and predicted ranks, which are shown in Table 3:

TABLE 3 Position 2 4 6 8 No-IC 0.27 0.26 0.23 0.20 IC 0.32 0.29 0.27 0.23

As can be seen, using IC achieves much higher NDCG than No-IC, which demonstrates the effectiveness of IC in capturing the importance order among codes.

We also evaluated how IC affects the sensitivity and specificity of code assignment. As can be seen from Table 2, No-IC degrades the two scores from 0.29 and 0.33 to 0.24 and 0.29 respectively, which indicates that IC is helpful in training a model that can more correctly assign codes. This is because IC encourages codes that are highly relevant to the patients to be ranked at top positions, which prevents the selection of irrelevant codes.

Attentional matching (AM): In the evaluation of this module, we compared with a baseline—No-AM, which performs an unweighted average of the M DDs:

${{\hat{h}}_{n} = {\frac{1}{M}{\sum\limits_{m = 1}^{M}h_{m}}}},$

catenates ĥ_(n) with u_(n), and feeds the concatenated vector into the final prediction layer. From Table 2, we can see our full model (with AM) outperforms No-AM, which demonstrates the effectiveness of attentional matching. In determining whether a code should be assigned, different DDs have different importance weights. No-AM ignores such weights, therefore performing less well.

AM can correctly perform the many-to-one mapping from multiple DDs to a CD. For example patient 190236 was given two DDs: ‘renal insufficiency’ and ‘acute renal failure’. AM maps them to a combined ICD code: 403.91 (hypertensive chronic kidney disease, unspecified, with chronic kidney disease stage V or end stage renal disease), which is in the ground truth provided by medical coders. On the contrary, No-AM fails to assign this code. On the other hand, AM is able to correctly map a DD to multiple CDs. For example, a DD ‘congestive heart failure, diastolic’ was given to patient 140851. AM successfully maps this DD to two codes: (1) 428.0 (congestive heart failure, unspecified); (2) 428.30 (diastolic heart failure, unspecified). Without AM, this DD is mapped only to 428.0

Holistic comparison with other baselines: In addition to evaluating the four modules individually, we also compared our full model with four other baselines proposed by for ICD coding. Table 2 shows the results. As can be seen, our approach achieves much better sensitivity and specificity scores. The reason that our model works better is two-fold. First, our model is based on deep neural networks, which has arguably better modeling power than linear methods used in the baselines. Second, our model is able to capture the hierarchical relationship and importance order among codes, can alleviate the discrepancy in writing styles, and allows flexible many-to-one and one-to-many mappings from DDs to CDs. These merits are not possessed by the baselines

Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. The software may reside on a computer-readable medium. A computer-readable medium may include, by way of example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk (e.g., compact disk (CD), digital versatile disk (DVD)), a smart card, a flash memory device (e.g., card, stick, key drive), random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a general register, or any other suitable non-transitory medium for storing software.

While various embodiments have been described above, they have been presented by way of example only, and not by way of limitation. Likewise, the various diagrams may depict an example architectural or other configuration for the disclosure, which is done to aid in understanding the features and functionality that can be included in the disclosure. The disclosure is not restricted to the illustrated example architectures or configurations, but can be implemented using a variety of alternative architectures and configurations.

In this document, the terms “module” and “engine” as used herein, refers to software, firmware, hardware, and any combination of these elements for performing the associated functions described herein. Additionally, for purpose of discussion, the various modules are described as discrete modules; however, as would be apparent to one of ordinary skill in the art, two or more modules may be combined to form a single module that performs the associated functions according embodiments of the invention.

In this document, the terms “computer program product”, “computer-readable medium”, and the like, may be used generally to refer to media such as, memory storage devices, or storage unit. These, and other forms of computer-readable media, may be involved in storing one or more instructions for use by processor to cause the processor to perform specified operations. Such instructions, generally referred to as “computer program code” (which may be grouped in the form of computer programs or other groupings), when executed, enable the computing system.

Terms and phrases used in this document, and variations thereof, unless otherwise expressly stated, should be construed as open ended as opposed to limiting. As examples of the foregoing: the term “including” should be read as meaning “including, without limitation” or the like; the term “example” is used to provide exemplary instances of the item in discussion, not an exhaustive or limiting list thereof; and adjectives such as “conventional,” “traditional,” “normal,” “standard,” “known”, and terms of similar meaning, should not be construed as limiting the item described to a given time period, or to an item available as of a given time. But instead these terms should be read to encompass conventional, traditional, normal, or standard technologies that may be available, known now, or at any time in the future.

Additionally, memory or other storage, as well as communication components, may be employed in embodiments of the invention. It will be appreciated that, for clarity purposes, the above description has described embodiments of the invention with reference to different functional units and processors. However, it will be apparent that any suitable distribution of functionality between different functional units, processing logic elements or domains may be used without detracting from the invention. For example, functionality illustrated to be performed by separate processing logic elements or controllers may be performed by the same processing logic element or controller. Hence, references to specific functional units are only to be seen as references to suitable means for providing the described functionality, rather than indicative of a strict logical or physical structure or organization.

Furthermore, although individually listed, a plurality of means, elements or method steps may be implemented by, for example, a single unit or processing logic element. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined. The inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. Also, the inclusion of a feature in one category of claims does not imply a limitation to this category, but rather the feature may be equally applicable to other claim categories, as appropriate.

The various aspects of this disclosure are provided to enable one of ordinary skill in the art to practice the present invention. Various modifications to exemplary embodiments presented throughout this disclosure will be readily apparent to those skilled in the art. Thus, the claims are not intended to be limited to the various aspects of this disclosure, but are to be accorded the full scope consistent with the language of the claims. All structural and functional equivalents to the various components of the exemplary embodiments described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. No claim element is to be construed under the provisions of 35 U.S.C. § 112, sixth paragraph, unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited using the phrase “step for.” 

What is claimed is:
 1. A method of assigning a set of international classification of diseases (ICD) codes to a patient, the method comprising: obtaining a diagnostic description vector from at least one diagnostic description record of the patient; and applying a machine-learned ICD code assignment algorithm to the diagnostic description vector to assign a set of ICD codes to the patient.
 2. The method of claim 1, wherein obtaining a diagnostic description vector comprises processing the at least one diagnostic description record with a long short-term memory (LSTM) recurrent neural network.
 3. The method of claim 2, wherein the diagnostic description vector comprises a plurality of hidden representations, each corresponding to a diagnostic description in the at least one diagnostic description record, and processing comprises obtaining the plurality of hidden representations using each of a character-level LSTM and a word-level LSTM.
 4. The method of claim 2, wherein the LSTM is a sequential LSTM.
 5. The method of claim 1, wherein applying a machine-learned ICD code assignment algorithm to the diagnostic description vector comprises: for each of the diagnostic descriptions included in the diagnostic description vector, selecting one or more ICD codes to assign to the patient based on a mapping function maintained in the machine-learned ICD code assignment algorithm that maps diagnostic descriptions to one or more ICD codes.
 6. The method of claim 5, wherein the mapping function maintained in the machine-learned ICD code assignment algorithm maps a diagnostic description to one or more ICD codes based on importance scores between the diagnostic description and a plurality of ICD codes.
 7. The method of claim 6, wherein the importance score of a diagnostic description to an ICD code is normalized across the plurality of ICD codes.
 8. The method of claim 6, wherein the plurality of ICD codes are included in an ICD vector obtained by processing at least one ICD code description record with a long short-term memory (LSTM) recurrent neural network.
 9. The method of claim 6, wherein the LSTM recurrent neural network is a tree-of-sequences LSTM.
 10. The method of claim 1, wherein the set of ICD codes assigned to the patient comprises a plurality of ICD codes, and further comprising applying an isotonic constraints algorithm to the plurality of ICD codes to obtain an order of importance among the plurality of ICD codes.
 11. The method of claim 10, wherein the isotonic constraints algorithm is based on an alternating direction method of multiplier.
 12. A system for assigning a set of international classification of diseases (ICD) codes to a patient, the system comprising: a diagnostic description encoding module configured to obtain a diagnostic description vector from at least one diagnostic description record of the patient; and an ICD code assignment module configured to apply a machine-learned ICD code assignment algorithm to the diagnostic description vector to assign a set of ICD codes to the patient.
 13. The system of claim 12, wherein the diagnostic description encoding module obtains a diagnostic description vector be being configured to process the at least one diagnostic description record with a long short-term memory (LSTM) recurrent neural network.
 14. The system of claim 12, wherein the ICD code assignment module is configured to, for each of the diagnostic descriptions included in the diagnostic description vector, select one or more ICD codes to assign to the patient based on a mapping function maintained in the machine-learned ICD code assignment algorithm that maps diagnostic descriptions to one or more ICD codes.
 15. A machine learning apparatus for generating a map between diagnostic descriptions and international classification of diseases (ICD) codes, the apparatus comprising: a processor; and a memory coupled to the processor, wherein the processor is configured to: generate representations of diagnostic descriptions in a form of diagnostic descriptions vectors; generate representations of ICD codes in a form of ICD vectors; process the diagnostic descriptions vectors and the ICD vectors to obtain an importance score between each diagnostic description represented in a diagnostic description vector and each ICD represented in an ICD vector, and associate each diagnostic description represented in the diagnostic description vector with one or more ICDs represented in the ICD vector based on the importance scores.
 16. The machine learning apparatus of claim 15, wherein the processor generates representations of diagnostic descriptions in the form of diagnostic descriptions vectors by processing at least one diagnostic description record with a long short-term memory (LSTM) recurrent neural network.
 17. The machine learning apparatus of claim 16, wherein the LSTM is a sequential LSTM.
 18. The machine learning apparatus of claim 15, wherein the processor generates representations of ICD codes in the form of ICD vectors by processing at least one ICD code record with a long short-term memory (LSTM) recurrent neural network.
 19. The machine learning apparatus of claim 18, wherein the LSTM is a tree-of-sequences LSTM.
 20. The machine learning apparatus of claim 15, wherein the processor is further configured to establish an order of importance for ICD codes in instances where a plurality of ICD codes are associated with a diagnostic description. 